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FROM FINITE SAMPLE TO ASYMPTOTICS: A GEOMETRIC 
BRIDGE FOR SELECTION CRITERIA IN SPLINE REGRESSION^ 

By S. C. Kou 

Harvard University 

This paper studies, under the setting of sphne regression, the 
connection between finite-sample properties of selection criteria and 
their asymptotic counterparts, focusing on bridging the gap between 
the two. We introduce a bias- variance decomposition of the prediction 
error, using which it is shown that in the asymptotics the bias term 
dominates the variability term, providing an explanation of the gap. 
A geometric exposition is provided for intuitive understanding. The 
theoretical and geometric results are illustrated through a numerical 
example. 



1. Introduction. A central problem in statistics is regression: One ob- 
serves {{xi,yi),i = 1,2, ... ,n} and wants to estimate the regression func- 
tion of y on X. Through the efforts of many authors, the past two decades 
have witnessed the establishment of nonparametric regression as a power- 
ful tool for data analysis; references include, for example, Hardle (1990), 
Hastie and Tibshirani (1990), Wahba (1990), Silverman (1985), Rosenblatt 
(1991), Green and Silverman (1994), Eubank (1988), Simonoff (1996), Fan 
and Gijbels (1996), Bowman and Azzalini (1997) and Fan (2000). 

The practical application of nonparametric regression typically requires 
the specification of a smoothing parameter which crucially determines how 
locally the smoothing is done. This article, under the setting of smoothing 
splines, concerns the data-driven choice of smoothing parameter (as opposed 
to a subjective selection); in particular, this article focuses on the connection 
between finite-sample properties of selection criteria and their asymptotic 
counterparts. 
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The large-sample (asymptotic) perspective has been impressively addressed 
in the literature. Some references, among others, include Wahba (1985), 
Li (1986, 1987), Stein (1990), Hall and Johnstone (1992), Jones, Marron 
and Sheather (1996), Hurvich, Simonoff and Tsai (1998) and Speckman and 
Sun (2001). 

Complementary to the large-sample (asymptotic) developments, Efron 
(2001) and Kou and Efron (2002), using a geometric interpretation of selec- 
tion criteria, study the finite- sample properties. For example, they explain 
(a) why the popular Cp criterion has the tendency to be highly variable [even 
for data sets generated from the same underlying curve, the Cp-estimated 
curve varies a lot from oversmoothed ones to very wiggly ones; see Kohn, 
Ansley and Tharm (1991) and Hurvich, Simonoff and Tsai (1998) for exam- 
ples], and (b) why another selection criterion, generalized maximum likeli- 
hood [Wecker and Ansley (1983), Wahba (1985) and Stein (1990)], appears 
to be stable and yet sometimes tends to undersmooth the curve. Roughly 
speaking, it was shown that the root of the variable behavior of Cp is its 
geometric instability, while the stable but undersmoothing behavior of gen- 
eralized maximum likelihood (GML) stems from its potentially large bias. 
In addition, they also introduce a new selection criterion, the extended ex- 
ponential (EE) criterion, which combines the strength of Cp and GML while 
mitigating their weaknesses. 

With the asymptotic and finite-sample properties delineated, it seems 
that we have a "complete" picture of selection criteria. However, a careful 
inspection of the finite-sample and asymptotic results, especially the ones 
comparing Cp and GML, reveals an interesting gap. On the finite-sample 
side, Cp's geometric instability undermines its competitiveness [Kohn, Ans- 
ley and Tharm (1991) and Hurvich, Simonoff and Tsai (1998)], which opens 
the door for the more stable GML, while on the large-sample (asymptotic) 
side different authors [e.g., Wahba (1985) and Li (1986, 1987)] have suggested 
that from the frequentist standpoint the Cp-type criterion asymptotically 
performs more efficiently than GML. This "gap" between finite-sample and 
asymptotic results naturally makes one puzzle: (a) Why doesn't the finite- 
sample advantage of GML, notably its stability, benefit it as far as large- 
sample (asymptotics) is concerned? (b) Why does the geometric instability 
of Cp seen in finite-sample disappear in the asymptotic considerations? 

This article attempts to address these puzzles. First, by decomposing the 
estimation error into a bias part and a variability part, we show that as 
sample size grows large the bias term dominates the variability term, thus 
making the large-sample case virtually a bias problem. Consequently in the 
large-sample comparisons, one is essentially comparing the bias of different 
selection criteria and unintentionally overlooking the variability — a situation 
particularly favoring the Cp-type criterion as it is (asymptotically) unbiased. 
Second, by studying the evolution of the geometry of selection criteria, we 
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show that the geometric instabihty of selection criteria gradually decreases, 
though rather slowly, which again benefits the Cp-type criterion, because it 
says as far as asymptotics is concerned, the instability of Cp evident in finite- 
sample studies will not show up. The recent interesting work of Speckman 
and Sun (2001) appears to confirm our results regarding asymptotics (see 
Section 2); they showed that GML and Cp agree on the relative convergence 
rate of the selected smoothing parameter. 

The connection between finite-sample and asymptotic results is illustrated 
by a numerical example (Section 4). The numerical example also indicates 
that for sample sizes one usually encounters in practice the EE criterion 
appears to behave more stably than both GML and Cp. 

The article is organized as follows. Section 2 introduces a bias-variance 
decomposition of the total prediction error, and investigates its finite- and 
large-sample consequences. Section 3 provides a geometric explanation to 
bridge the finite-sample and asymptotic results regarding selection criteria. 
Section 4 illustrates the connection through a simulation experiment. The 
article concludes in Section 5 with further remarks. The detailed theoretical 
proofs are deferred to the Appendix. 

2. A bias-variance decomposition for prediction error. 

2.1. Selection criteria in spline regression. The goal of regression is to 
estimate f{x) = E{y\x) from n observed data points {{xi,yi),i = 1,2, ... , n}. 
A linear smoother estimates f = (/(xi), /(X2), . . . /(x„))' by ix = Axy, where 
the entries of the n x n smoothing matrix Ax depend on x = {xi,X2, . . . , x„) 
and also on a nonnegative smoothing parameter A. One class of linear smoothers 
that will be of particular interest in this article is (cubic) smoothing splines, 
under which 



where U is an n x 72 orthogonal matrix not depending on A, and = 
diag(aAi), a diagonal matrix with the ith diagonal element axi = 1/(1 + A/cj), 
i = 1,2, . . . ,n. The constants k = {ki,k2, ■ . ■ , kn), solely determined by x, are 
nonnegative and nondecreasing. The trace of the smoothing matrix tr(AA) 
is referred to as the "degrees of freedom," dfx = tr(AA), which agrees with 
the standard definition if A^ represents polynomial regression. 

To use splines in practice, one typically has to infer the value of the 
smoothing parameter A from the data. The Cp criterion chooses A to min- 
imize an unbiased estimate of the total squared error. Suppose the yj's are 
uncorrelated, with mean /j and constant variance a^. The Cp estimate of 
A is A*-^p = argmin;^{CA(y)}, where the Cp statistic Cx{y) = ||y — + 
2(T^tr(A;^) — na^ is an unbiased estimate of -E||fA — f|p. 




Aa = UhaU' 
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The generalized maximum likelihood (GML) criterion [Wecker and Ansley 
(1983)] is another selection criterion motivated from empirical Bayes con- 
siderations. If one starts from y ~ A'^(f, cr^I), and puts a Gaussian prior on 
the underlying curve: f ~ N{0, cr^A;^(I — A;^)~^), then by Bayes theorem, 

(2.2) y-iV(0,a2(I- Aa)-^), f |y ~ iV(AAy, Aa). 

The second relationship shows that fx = Axy is the Bayes estimate of f . 
The first relationship motivates the GML: It chooses A*^^'^ as the MLE of 
A from y~iV(0,cj2(I- Aa)-^). 

The setting of smoothing splines (2.1) allows a rotation of coordinates, 

(2.3) z = U'y/cT, g = U'f/a, gx = U'ix/a, 

which leads to a diagonal form: z ~ A^(g,I), gA = a.xz. Let ^Ai = 1 — CLXi and 
^A = (^Ai) ^A25 • • • ) ^An)- In the new coordinate system, the Cp statistic can 
be expressed as a function of z^, Ca(z^) = J27=i{b'xizf — 2bxi) + na^, and 
correspondingly 

XCp = avgmmY,(.blizf - 2bxi). 
^ i>2 

Under the coordinate system of z and g, since z ~ iV(0, diag(b^^)), g|z ~ 
A^(aAZ,aA), 

A^ML = MLE of z ~ Ar(o,diag(b^i)) = argmin^(6A^22 - log^Ai)- 

^ i>2 

Because z and g offer simpler expressions, we will work on them instead 
of y and f whenever possible. The extended exponential (EE) selection cri- 
terion, studied in Kou and Efron (2002), provides a third way to choose the 
smoothing parameter. It is motivated by the idea of combining the strengths 
of Cp and GML while mitigating their weaknesses, since in practice the Cp- 
selected smoothing parameter tends to be highly variable, whereas the GML 
criterion has a serious problem with bias (see Section 4 for an illustration). 
Expressed in terms of z, the EE criterion selects the smoothing parameter 
A according to 

F^ = argmin5:[C&A.^.'/'-36lf], 

^ i>2 

where the constant C = 2'i/'iT{j/&) ~ ^-^OS. Kou and Efron (2002) explained 
its construction from a geometric point of view and illustrated through a 
finite-sample nonasymptotic analysis that the EE criterion combines the 
strengths of Cp and GML to a large extent. 
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An interesting fact about the three criteria (Cp, GML and EE) is that 
they share a unified structure. Let p > 1, g > 1 be two fixed constants. Define 
the function 



(2.4) 



E 



^(c,6l{^n.-log6i/^), ifj,= l, 



where Cq = 2i/gr(i/2+i/g) ' ^^'^ ^ corresponding selection criterion 

(2.5) A^*'''^) = argmin{/5f'''^(z2/9)}. 

A 

Then it is easy to verify that (i) l^^'''^ — > l^^'''^ as p ^ 1; (ii) taking p = 1, 
q = l gives the GML criterion; p = 2, q=l gives the Cp criterion; p = q = ^ 
gives the EE criterion. The class (2.5), therefore, unites the three criteria in 
a continuous fashion. This much facilitates our theoretical development as 
it allows us to work on the general selection criterion A^^''^^ and take {p, q) 
to specific values to obtain corresponding results for EE, Cp and GML. 

2.2. The unbiasedness of Cp. To introduce the idea of bias- variance de- 
composition, we first note that for each selection criterion A*-^'"^^ there are an 
associated central smoothing parameter A^^''^^ and central degrees of free- 
dom dfjf''^^ obtained by applying the expectation operator on the selection 
criterion (2.5): 

(2.6) A^5) =argmin£;{/5f'^^(z2/5)}, 

A 

(2.7) d/iP''?)=tr( A (,„)). 

Since (2.6) is the estimating-equation version of (2.5), from the general the- 
ory of estimating equations it can be seen that A^^'"^^ and df ' are centered 
around X^'^'^ and dfc^''^^ in the sense that X^'''^ and dfc^'''^ are the asymp- 
totic means of A^^''''^ and df^^'''\ Thus x'f''^^ and dfc^'"^^ index the central 
tendency of the selection criterion-(p, g). 

Next we introduce the ideal smoothing parameter Aq and the ideal de- 
grees of freedom dfo = tr(AAo), which are intrinsically determined by the 
underlying curve and do not depend on the specific selection criterion one 
uses: 

(2.8) Ao = argminE'f ||f;^ — f |p = argmin£'||g;^ — g|p. 

A A 

The risk -EUgAo ~ g|P associated with Aq represents the minimum risk one 
has to bear to estimate the underlying curve. Therefore, to compare the 
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performance of different selection criteria one can focus on the extra risk: 
-^IISa ~ siP ~ -^IISAo ~ g|P- See Wahba (1985), Hardle, Hall and Marron 
(1988), Hall and Johnstone (1992), Gu (1998) and Efron (2001) for more 
discussion. 

Having introduced the necessary concepts, we state our first result, the 
unbiasedness of Cp. 

(2 1) 

Theorem 2.1. The central smoothing parameter Ac ' and degrees of 

(2 1) 

freedom dfc of Cp correspond exactly to the ideal smoothing parameter 
and degrees of freedom 

Proof. First, from the definition (2.8) a straightforward expansion gives 
(2.9) Ao = argmin 5] (62,(^2 + i) _ 2fe^^). 

A 

I 

Next, for Cp according to (2.6) its central smoothing parameter 

A(2.i) = argmmi?{zf ^)(z2)} = argmmi^j^i^i^zf _ 26,,]| 

(2.10) 

= argmin^ (61,(52 + 1) _ 2bx^)■ 

A 

I 

The proof is complete because (2.9) and (2.10) give identical expressions for 
Ao and A^^\ □ 

Since no other element from the selection criteria class (2.5) possesses this 
property of unbiasedness, the result of Theorem 2.1 gives Cp an advantage 
over the others. As we shall see shortly, this advantage is the main factor 
that makes the asymptotic consideration favorable for Cp. 

2.3. The bias-variance decomposition. The results developed so far work 
for all sample sizes. Next we turn our attention to the large-sample case. 
There is a large amount of literature addressing the large-sample properties 
of selection criteria. The well-cited asymptotic results [Wahba (1985) and Li 
(1986, 1987), among others] suggest that as far as large-sample is concerned 
the Cp-type criterion outperforms GML. This interestingly seems at odds 
with the well-known finite-sample results. For example, Kohn, Ansley and 
Tharm (1991) and Hurvich, Simonoff and Tsai (1998), among others, illus- 
trate that finite-sample-wise the Cp criterion has a strong tendency for high 
variability in the sense that even for data sets generated from the same un- 
derlying curve the Cp-estimated curves vary a great deal from oversmoothed 
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ones to very wiggly ones, which contrasts with the stably performing GML. 
To understand why there is this gap between finite- and large-sample results, 
we will provide a bias-variance decomposition of the prediction error, based 
on which it will be seen that the major reason is that the large-sample con- 
sideration virtually only looks at the bias, as bias asymptotically dominates 
variability. 

The central smoothing parameter and central degrees of freedom defined 
previously pave the way for the bias-variance decomposition. Consider the 
prediction error for estimating the curve -E||fj^(p,q) — f|P, which is equal to 
cr^ £'||gj^(p^q) — g|p according to (2.3). We can write 

E\\gX(p,,) -gf 

= E\\{gX(p,,) - g^{P,9)) + {g^iP,,) -g)f 

= E\\g^{p,q) - gf + 2^(g^(p,,) - g)'(gj^(p,,) - g^{p,9)) + ^l|gA{p,9) - g;,(P.9) 

Consequently, the extra risk beyond the unavoidable risk -EUgAo ~ g|P can 
be written as 

EWg'xiP,,) -gf -^llgAo -gf 

(2.11) = {E\\g,p,,^ - gf - E\\gx, - gf) 

f 2 

+ 2£;(g^{p,5) - g) {g%(p,q) - gy(j>'^))+ ^l|gA(p,9) - gx^P,l) II • 

This expression provides a bias- variance decomposition for the prediction er- 
ror. The first term E\\g ip,q) — g|p — -EUgAo — g|P can be viewed as the bias 
term — it captures the error of estimating the curve g beyond the unavoid- 
able risk by using the central smoothing parameter A^^''^'*, which measures 
the discrepancy between the central risk associated with A^^'''^ and the ideal 
minimum risk; the third term E'Ugcjp — g. |p can be viewed as the vari- 
ability term — it measures the variability of g'^(p,q) from its "center" g^(p,q) ; 
the second term, the covariance, arises here due to the nature of adaptation 
(the smoothing parameter itself is also inferred from the data, in addition 
to estimating the curve). 

Clearly, for any practical finite-sample problem, each term in (2.11) con- 
tributes to the squared prediction error. However, we shall show that as the 
sample size n grows large the bias term gradually dominates the other two. 
To focus on the basic idea, without loss of generality, we assume the design 
points {xi,X2, . . . ,Xn) are n equally spaced points along the interval [0,1]. 
Section 5 will discuss the setting of general design points. 

In what follows, to avoid cumbersome notation, we will write A for A^^''^\ 

3f for 3f^^'^\ Ac for A^^'^^ df^ for and so on. The fuh notation A(P'«), 
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Ac^'''^ , dfc^'''^ will be used whenever potential confusion might arise. Consider 
the bias term -EjlgAc — gP — -^IIsAq ~ g|P first: 

n 

E\\gx. - gf = E\\ax.z - gf = Y^ibhaf + 4j 

(2-12) 



i=l i=l 



2 



where the last equality uses the fact bxi = i^^^k ~ ^^i^^^i- To obtain the 
asymptotic orders, we need to know how Ac, the central smoothing param- 
eter, evolves as the sample size gets large. According to definition (2.6), 
Ac satisfies the normal equation -gyl^^''^\E{'2p'/'^^)\x=x^ =0, which (through 
some algebra) can be written as 

(2.13) Y.-xA:^{c,E{zl"^} - 1) = E«A..6ti)/« - E«A.&S. 

i i i 

The following lemma gives the order of the left-hand side of (2.13). 

Lemma 2.2. Under mild regularity conditions, for p>q, ^iCLx^i^xl, ^ 
(c,i?{zf/''}-l) = 0(Ac). 

The regularity conditions and the proof of Lemma 2.2 are given in the 
Appendix. The proof uses one handy result of Demmler and Reinsch (1975), 
where by studying the oscillation of the smoothing-spline eigenvectors, it is 
effectively shown that for any curve fix) satisfying < /g f"{t)'^ dt < oo, 

n , 1 

(2.14) < y hgf < / f'itf dt <oo for ah n > 3. 

See also Speckman (1983, 1985) and Wahba (1985). For the right-hand side 
of (2.13), the following theorem, taken from Kou (2003), is useful. 

Theorem 2.3. Suppose ^ — > oo and n^X oo. Then for r > ^, s > —j, 
Y./Mbi. = J^B(r--,s +o 
where the beta function B{x,y) =T{x)r{y)/r{x + y). 



Applying this result, the right-hand side of (2.13) is 

' n 

J. 
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Matching it with the result of Lemma 2.2 gives 

(2.15) A(P'«) = 0(n^/^) for ah p>q, 

which furthermore implies (taking r = 1, s = in Theorem 2.3) 



n n1/4 



(2.16) df!:P'^^ = 0((-—] ]=0{n'/') iorp>q. 



Note that (2.15) and (2.16) cover GML, Cp and EE, since all three satisfy p > 
q. With the help of Theorem 2.3 and (2.15), we can calculate the asymptotic 
order of the bias term i?||gAc ~ g|P ~ -^IIsAo ~ g|P- By inequality (2.14), the 
first term of (2.12) 

n n 
i=l i=l 

and (from Theorem 2.3) the second term of (2.12) X;r=i «A,i = Oii^Y'"^) = 
0{n^/^). Adding them together yields 

(2.17) i?||g^(,„)-gf = 0(ni/5). 
Identical treatment of the ideal smoothing parameter Aq gives 

(2.18) i?||gAo-gf = 0(n^/'). 

Combining the results of (2.17) and (2.18), we observe that for a "gen- 
eral" criterion A^^''^^ the bias term -E||g. (p,g) — g|p — -EUgAo — g|p = 0(n^/^). 
We put a quotation mark on "general" because there is one exception: 
Cp. In Theorem 2.1 we have shown that Ac = Aq, which implies that 
-^l|gi(2,i) — g|P — -E'llgAo ~ g|P = 0- The following theorem summarizes the 

Ac 

discovery and extends the result to the variability and covariance terms in 
the decomposition. 

Theorem 2.4. Under mild regularity conditions provided in the Ap- 
pendix, for all p>q: 



(i) the bias term 

Oin^^), ^/(p,g)/(2,l) 
if {p,q) = (2,1), 

(ii) the covariance term E{g^(j,,,) - g)'(gj^(p,,) - g;^{p,9)) = 0{l), 

A(P,9) - g^C 



E\\S^(P,1) - gf - -^llgAo - gf = I ^ 



(iii) the variability term £^||g?(p — g, (p,?) f = 0(1) 



Therefore, the extra risk 

E\\^^, ^_„||2_^||^, _„||2^/0(ni/5), ^/(p,g)/(2,l), 
^l|gA(P„) g|l ^llgAo g|l 1^(1)^ ^/(p,g) = (2,l). 
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The regularity conditions and the proof of Theorem 2.4 are given in the 
Appendix. From Theorem 2.4 we observe that in general the bias term 
asymptotically dominates the other two. It is the unbiasedness of Cp that 
gives it the asymptotic advantage. In other words, when one compares the 
asymptotic prediction error for different criteria, essentially the comparison 
is focused on the bias, and as long as asymptotics is concerned the variabil- 
ity of the criteria does not matter much. Theorem 2.4, therefore, provides 
an understanding of the gap between finite-sample and asymptotic results 
regarding selection criteria. Since the asymptotic comparison essentially fo- 
cuses on the bias and Cp is unbiased, it is not surprising that the high 
variability of Cp evident in finite-sample studies does not show up in the 
large-sample considerations. Furthermore, (2.18) and Theorem 2.4 say that 
for all three selection criteria of interest, GML, Cp and EE, the averaged 
prediction error ^£'||gj^ — g|p is of order 0{n~'^/^), an order familiar to many 
nonparametric problems. Speckman and Sun (2001) studied the asymptotic 
properties of selection criteria; they showed that GML- and Cp- estimated 
smoothing parameters have the same convergence rate, which, from a dif- 
ferent angle, conveys a message similar to Theorem 2.4. 



3. A geometric bridge between the finite-sample and asymptotic results. 

In this section, to obtain an intuitive complement to the result of Section 
2, we provide a geometric explanation of why the finite-sample variability 
does not show up in the asymptotics. 

3.1. The geometry of selection criteria. The fact that A'-^''^^ chooses A as 
the minimizer of /^^'"^^ implies that A^'^''^-* must satisfy the normal equation 
-§xl^\''^\'2'^^'')\-x=\{p,q) — which (through simple algebra) can be written as 

where the vector r^t'^ = (r)?,'"),^?/), . . . ,r)?„^))', r^i^"^ = -^a,,(c,6i/^)^ 

l^t'^ = (/^ir^/^?2''^•••,/^£''^) and /ii^) = l/(c,6i{^). This normal equa- 
tion representation suggests a simple geometric interpretation of the A^^'''^ 
criterion. For a given observation z, the smoothing parameter is chosen by 
projecting z^/'^ onto the line {^J^^x "^^ : A > 0} orthogonally to the direction 
Figure 1 diagrams the geometry two-dimensionally. 
In Figure 1 is the hyperplane c'^^''^^ = {z : (i)^^'''^)' {z'^/'^ - fi^^''^^) = 0}. 

Finding the specific hyperplane C^''^^ that passes through z^/'^ is equivalent 
to solving (3.1). It is noteworthy from Figure 1 that different hyperplanes 
C^^'''^ are not parallel, but rather intersect each other, while points on the 
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intersection of two hyperplanes satisfy both normal equations. This phe- 
nomenon is termed the reversal effect in Efron (2001) and Kou and Efron 
(2002). Figure 2 provides an illustration, showing one hyperplane C^^^'^^ in- 
tersecting a nearby hyperplane C^Xo+d\ (^°^' ^ small dX). 

Intuitively, if an observation falls beyond the intersection (i.e., in the 
reversal region), the selection criterion A^^''^'^ then will have a hard time as- 
signing the smoothing parameter. Furthermore, we observe that for X^P'''\ 




Fig. 1. The geometry of selection criteria. Two coordinates and Zj (* < j) o-'re 

indicated here. 




Fig. 2. Illustration of the reversal effect caused by the rotation of the orthogonal direc- 
tions. 
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if the direction rf^'''^ rotates very fast, the reversal region will then be quite 
large, causing the criterion to have a high chance of encountering observa- 
tions falling into the reversal region. This reversal effect is the main factor 
behind Cp's finite-sample unstable behavior, because the Cp orthogonal di- 
rection *7^'^'^^ rotates much faster than both the EE direction ^^^^^/^'^/^^ and 

the GML ^7^^'^'' [Kou and Efron (2002)]. It is worth pointing out that the 
geometry and the reversal effect do not involve asymptotics. Thus finite- 

• (2 1) 

sample-wise, the faster rotation oi r]\ costs Cp much more instability 
than the EE and GML criteria, undermining its competitiveness. 

3.2. The evolution of the geometry. The geometric interpretation nat- 
urally suggests we investigate the evolution of the reversal effect (i.e., the 
geometric instability) as the sample size grows large to bridge the gap be- 
tween finite- and large-sample results. There are two ways to quantify the 
geometric instability. First, since the root of instability is the rotation of the 
orthogonal directions, the curvature of the directions, which captures how 
fast they rotate, is a measure of the geometric instability. Second, one can 
investigate the probability that an observation falls into the reversal region, 
which directly measures how large the reversal effect is. 

For the orthogonal direction its statistical curvature [Efron (1975)], 

which measures the speed of rotation, is defined by 

det(MA) N 



7A 



with AIx 



where r)^^"^ = |^»7?''\ and the matrix = diB.gic-q^^^^h-^"^^^ ' " / p) . For 
the selection criteria class (2.5), Kou and Efron (2002) showed that the 
squared statistical curvature 

Theorem 3.1. The curvature evaluated at the ideal smoothing parame- 
ter Aq has the asymptotic order jxo — 0{n~^/^^) . 

Proof. According to Theorem 2.3, 71^^ = 0((^)"i/'^), which is 0(?i"i/^) 
by (2.15). □ 



Theorem 3.1 says that, first, for the selection criteria class (2.5), geomet- 
rically as the sample size gets larger and larger, the orthogonal directions 
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will rotate more and more slowly, which will make the geometric instability 
smaller and smaller; second, for different selection criteria, the curvature 
decreases at the same order. 

Next, we consider the probability of an observation falling into the reversal 
region. Following Kou and Efron (2002), the reversal region (i.e., the region 
beyond the intersection of different hyperplanes) is defined as 

reversal region = {z : Ro{z) < 0}, 

where the function Ro{z) is given by i?o(z) = /^''^^(z^/'?) - f3xoixf\z'^^'^) 
with I'f'"^ defined in (2.4), = = and the constant 

/?Ao=-i[2-(l + f)^4^]- 

Theorem 3.2. Under mild regularity conditions, the probability that an 
observation will fall into the reversal region satisfies 



P(i?o(z)<0)-cD(riP'^))^0 



as oo, 



where $ is the standard normal c.d.f. and for all p>q the S6QU6TIC6 Tfi — 
O(ni/i0) <0. 

The regularity conditions and proof are deferred to the Appendix. Theo- 
rems 3.1 and 3.2 point out that as the sample size n grows large, the reversal 
effect, which is the source of Cp's instability, decreases at the same rate for 
all (p, g)-estimators and eventually vanishes. This uniform rate is particu- 
larly beneficial for Cp, because under a finite-sample size, Cp suffers from 
the reversal effect a lot more than the other criteria, such as GML and EE. 
Theorems 3.1 and 3.2 thus explain geometrically why the high variability 
of Cp observed by many authors in finite-sample studies does not hurt it as 
long as asymptotics is concerned. 

4. A numerical illustration. In this section through a simulation experi- 
ment we will illustrate the connection between finite-sample and asymptotic 
performances of different selection criteria, focusing on Cp, GML and EE. 
The experiment starts from a small sample size and increases it gradually 
to exhibit how the performance of different selection criteria evolves as the 
sample size n grows. 

In the simulation the design points x are n equally spaced points on the 
[—1,1] interval, where the sample size n starts at 61, and increases to 121, 
241, . . . , until 3841. For each value of n, 1000 data sets are generated from 
the curve f{x) = sin(7r(x + l))/(x/2 + 1) shown in Figure 3 with noise level 
cr = 1 . The Cp , GML and EE criteria are applied to the simulated data to 
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choose the smoothing parameter (hence the degrees of freedom), which is 
subsequently used to estimate the curve. 

The bias-variance relationship can be best illustrated by comparing the 
estimated degrees of freedom (from different selection criteria) with the ideal 
degrees of freedom dfo, since Efron (2001) suggested that the comparison 
based on degrees of freedom is more sensitive. Figure 4 shows the histograms 
of Cp, GML and EE estimated degrees of freedom under various sample sizes; 
the vertical bar in each panel represents the ideal degrees of freedom dfQ. 

One can observe from Figure 4 that (i) Cp is roughly unbiased; (ii) as 
sample size increases, the bias of GML is gradually revealed; (iii) the large 
spread of Cp estimates points out its high variability even for sample size 
as large as 3841. The asymptotic results, overlooking the variability, in a 
certain sense reveal only part of the picture. 

Table 1 reports the squared curvature of different selection criteria under 
various sample sizes; one sees that the curvature of Cp is significantly larger 
than that of GML or EE, meaning that finite-sample-wise, Cp suffers more 
from geometric instability. Although the geometric instability (measured by 
the curvature) becomes smaller and smaller as the sample size gets larger 
and larger, it decreases quite slowly, indicating that unless the sample size 
is very large, the variability cannot be overlooked (as the asymptotics would 
do). 




-1.0 -O.B 0.0 O.S 1.0 



Fig. 3. The curve used to generate the data. 



Table 1 

The squared curvature of Cp, GML and EE 





n = 61 


n = 121 


n = 241 


n = 481 


n = 961 


n = 1921 


n = 3841 


Cp 


0.71 


0.63 


0.57 


0.51 


0.46 


0.41 


0.37 


GML 


0.08 


0.07 


0.06 


0.05 


0.04 


0.04 


0.03 


EE 


0.29 


0.26 


0.23 


0.21 


0.19 


0.17 


0.15 
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degrees of freedom. The vertical bar in each panel is 



the ideal degrees of freedom. 



Table 2 

The sample mean and standard deviation o/ ||g^^ — g| 







n = 61 


n = 121 


n = 241 


n = 481 


n = 961 


n = 1921 


n = 3841 


Cp 


mean 


6.22 


6.41 


6.75 


7.34 


7.45 


8.13 


9.20 




std dev 


4.81 


4.54 


4.42 


4.42 


4.25 


4.33 


4.91 


GML 


mean 


5.90 


5.68 


5.91 


6.61 


7.01 


7.85 


9.10 




std dev 


4.03 


3.34 


3.18 


3.47 


3.39 


3.79 


4.07 


EE 


mean 


5.89 


5.78 


6.10 


6.73 


7.03 


7.78 


8.86 




std dev 


4.04 


3.34 


3.33 


3.53 


3.49 


3.83 


4.08 



Table 2 reports the average value and standard deviation of ||gj^(p,9) — g|| , 
the squared estimation error, across the data sets. It is interesting to observe 
that (i) the standard deviation of Cp estimates is larger than that of GML 
and EE, since geometrically Cp suffers more from the reversal effect than the 
other two; (ii) for small sample sizes, GML appears to work better than Cp 
as the asymptotics come in rather slowly; (iii) for reasonable sample sizes 
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from 61 to 3841, as one usually encounters in practice, the EE criterion 
appears to behave stably well. 

Comparing Table 2 with the result of Theorem 2.4, a careful reader might 
notice that this example itself illustrates the "seeming" gap: For sample size 
as large as 3841 the asymptotics are still not there. This, again, is due to 
the fact that although Cp's unbiasedness gives it an asymptotic competitive 
edge, the asymptotics come in rather slowly, and, therefore, for finite-sample 
size at hand one cannot neglect the variability, which evidently causes Cp 
more trouble than the others in Table 2. 

5. Discussion. This article investigates the connection between finite- 
sample properties of selection criteria and their asymptotic counterparts, 
focusing on bridging the gap between the two. Through a bias-variance de- 
composition of the prediction error, it is shown that in asymptotics bias 
dominates variability, and thus the large-sample comparison essentially con- 
centrates on bias, and unintentionally overlooks the variability. As the ge- 
ometry intuitively explains how different selection criteria work, the article 
also studies the evolution of the geometric instability, the source of Cp's high 
variability, and shows that although the geometric instability decreases as 
sample size grows, it decreases very slowly so that for sample sizes one usu- 
ally encounters in practice, it cannot be neglected. We conclude the article 
with a few remarks. 

Remark 5.1. General design points. We have assumed that the design 
points X = (xi, . . . equally spaced along a fixed interval. If x are 

drawn, instead, from a distribution function G such that Xi = G~^{(2i — 
l)/n), then essentially all the results would remain valid. For example, the 
conclusion of Theorem 2.3 changes to 

where g{x) is the density of G over the domain X [Kou (2003)]. Correspond- 
ingly, the asymptotic orders that we derived will remain the same (except 
for longer expressions in the proofs). 

Remark 5.2. Unknown a"^ . To focus on the basic ideas, we implicitly 
assumed to be known in our analysis. If o"^ is unknown, we can re- 
place it with an estimate a^, which changes (2.3) to z = U'y/a = z,{a/a) 
and 7?/^ = z^/'^i?, where R = (cr^/a^)-'^/'', leading to the estimator A^^'''^ = 

argminA{iSf'^^(z2/9)}, and likewise Jr'''\ If (1 ,vari^) is independent of 
z^/"?, it is easy to see that 

(5.1) z^/"? ~ (^(z^/?) , var z^/"? + var^j • {E{z'^/'^)E{z'^/''y + var z^/"?) ) , 
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where the notation X ~ (a,/3) means X has mean a and variance f3. The 
extra uncertainty of makes the estimate more variable. For example, it 
can be shown that 



varjd/ I . , , 
varjd/ } 



Ac* Ac* 

which shows the loss of precision in df^''^^ from having to estimate a^. 
Likewise, our results in Sections 2 and 3 can be modified (at the expense of 
more complicated calculations) without changing the conclusion. In practice, 
the estimate can be based on the higher components of U'y ~ (erg, cr^I), 
for instance, 

n 

E (UV)f/(M-2), 

i=n-l-M 

because the assumed smoothness of f implies that gi = for i large and that 
and z^/'' are nearly independent, which makes (5.1) valid. 

Remark 5.3. Higher-order smooth curves. In Section 2.3, we showed 
that for general curves EE, Cp and GML gave the same order 0(n~^/^) 
for the averaged prediction error ii?||gj;^ — g|p. A reader familiar with the 
work of Wahba (1985) might sense this as a puzzle, because there it is 
shown that Cp (GOV) has a faster convergence rate than GML. This seeming 
conflict actually arises from the difference in the requirements. Wahba (1985) 
worked on higher-order smooth curves that belong to the null space of the 
roughness penalty. In our context of cubic smoothing splines they are the 
curves such that / f"[x)'^ dx = 0, namely, linear lines. In contrast we have 
assumed J f" (x)'^ dx > 0, and termed them "general curves"; see (2.14). 

Remark 5.4. Generalizations of Cp and GML. A number of authors 
have suggested modifying Cp or GML, including (i) general Cp, whose crite- 
rion is Cp{X) = ||y — txW^ + 2LL>a^ tv{Ax), (ii) general GCV, whose criterion 
is GCV{X) = \\y — fAp/ {l — LO tr{Ax)/n)'^ , and (iii) a full Bayesian estimate 
by putting a prior on the unknown smoothing parameter A. Taking a; = 1 in 
(i) and (ii) results in the classical Cp and GCV. One can also see (through 
a Taylor expansion) that (i) and (ii) are asymptotically equivalent. Using 
a number a; > 1 will make the estimate stabler since a heavier roughness 
penalty is assigned; on the other hand, this will cause the Cp criterion to 
lose its unbiasedness, since the central smoothing parameter will no longer 
coincide with the ideal smoothing parameter Aq- The finite-sample stability 
will thus trade off Cp's asymptotic advantage. The full Bayesian approach 
(iii) is expected to behave even more stably than GML. An interesting open 
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problem is to investigate how large its bias will be and how its geometry, if 
possible, will evolve as sample size grows. 

Remark 5.5. Regularity conditions. All the regularity conditions for the 
theoretical results, such as Assumptions A.1-A.4 in the Appendix, can be 
summarized simply as 



Strict equality holds in the case of Cp and GML, where q= 1, Cq = 1: 



which point out that the conditions are reasonably mild. 

APPENDIX: REGULARITY CONDITIONS AND DETAILED PROOFS 

Regularity conditions for Lemma 2.2. 



Assumption A.l. E^ax^^bZU^,E{zf''} - 1) = O(E.aA,.60g2). 



To see the validity of the assumption, we notice that q = 1 for Cp and 
GML, and J2iax,iblj{cgE{zf} - 1) exactly equals J^iaXciKcidi ■ Assump- 
tion A.l, hence, clearly holds true for Cp and GML, indicating its mildness. 
The proof below provides more discussion. 

Proof of Lemma 2.2. To prove the lemma, we need the following 



result of Kou and Efron [(2002), Lemma 1]: For Zi ~ N{gi,l), E{zf'^) = 
^2^/^(1 + \)M{-'^,\,-\gf), where M(-,-,-) is the confluent hypergeo- 

metric function (CHF) defined by M(c, d, z) = 1 + f H h + • • •, with 

{d)n = d{d + l)---{d + n-l). Applying the bounds of CHF [Chapter 13 of 
Abramowitz and Stegun (1972)]: 1 + - ^(1 - l)gf < M(-|, 1, -ig^) < 




c,E{zl/'']^l + ^^gl 

var z^J'^ const + const gf , 
E{z1^'^})^ const + const (7?. 



E{zi] = l + gl 
varz2 = 2 + 4g2, 

E{zf-l-gff = ^ + 2Agl 



1 + , one has 



(A.l) 




^^gt<c,E{zf'^}-l<^^gl 
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The left-hand side of (2.13) is thus bounded above by ^J2i'^\ci^^xj9h 

below by 1 <^x.i^xihf " ^(1 " |) ax^^^xihf■ From (2.14), hgf < 
Jq f" (t)'^ dt < 00, suggesting that for n sufficiently large, the term ^gf 
of (A.l) dominates, which again points out that Assumption A.l is mild. 

In light of (2.14), E^ax^^bli1gf = XcEialjlir\hg^) = 0(A,), for p>q, 
which according to Assumption A.l implies that 

Y: axAl^c,E{z\l'^} - 1) = 0(A,) for p > g. 

i □ 

To prove Theorem 2.4, we need the following approximation. 
Lemma A.l. 

E^ll- - l|2 



(A.2) 



Q2ji^{z2/.}) 



^(gA, -g)'(gj, -gAj 



(A.3) 



QA.(i?{^2/n) 

X Z^flA^iOAci (aA,iCOv(Zi ) -5iCOv(Zi,Zj )), 



where the function (5a(u) is defined by 
(A.4) QA(u) = ^aA.6lr'^^'{-«A.+ 



1 + ^^ aA^-2 



(c,6i{V-l)}. 



Derivation of Lemma A.l. Since A by definition is a function of 
u = z^/'', and Ac is a function of E{z'^^^}, applying a Taylor expansion on 
a-^. - ax^i, we obtain 



Hi - «Ac 



axcibxci sr^ dX 
Ac ' ^ 9uj 



2/91 



U=£;{z2/'3} 



20 



S. C. KOU 



Some algebra, after applying the implicit function calculation to the defini- 
tion (2.5) of A or equivalently to the normal equation (3.1), yields -§^\u=e{z'^/i} 



\ i.p/ <? 

-XcCqaxjb^. 



which then gives 



(A.5) g,^ - = (a,, - a,Jz. ^ ^^0^Y.-^A>T " M"^)' 



3 



The fact that the independent of each other implies 



E2 u 



2 , 2p/g ^^^^ 2/q 



var z- 



Summing over i yields the approximation 



The approximation of E{gXc ~ sYiSx ~ SXc) can be obtained in a similar 
way. □ 

Before proving Theorem 2.4, we state its regularity conditions. Theo- 
rem 2.4 needs the following assumptions, in addition to Assumption A.l. 



Assumption A. 2. 



pli 

Xci 



1 + ^ )aw-2 



Xri 



axci - 2 



91], 



2 j2p/q 2/q 

^ var Zi 



Xci 



:0(^max(^5:ai^,6j^Eai^,fe£V^ 
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Assumption A.3. j:MA7^'yE{zrz^-/'^} = 0(max(E.(ai^,6i+^/^)',E.(aL.6l?/^)'ff.^)), 
for /, ?n, n £ {1, 2}. 

Like Assumption A. 2, these two assumptions are exactly true for GML 
and Cp, since E{zf} = 1 + gf, and var(2:?) = 2 + Agf . In general, a Taylor 
expansion on the CHF can show for q>l, 

(A. 6) var zf^'^ = const 1 + const 2 ■ gf + 0{gf), 

which suggests that the assumptions are mild. 

Proof of Theorem 2.4. Write 

TermA= {Y.-ljUgf + l)^ (Z^L^A^var , 

TermB = aUhlT'E[{z^ - g^ - - Eizf^]?]- 

i 

then approximation (A. 2) becomes 

(A.7) EUx - gA. f = ^2 (Term A + TermB). 

For Term A, note that according to Assumption A. 2 the order of '^\A\^i^ ^ 
var zl'" is the maximum of «A,i^S'' and Ei ^l^^^i'^gf. But Ei o^^^?) Jf'' = 
0((^) 1/4) = 0(nV5) by Theorem 2.3, and aljfj''gf = 0(Ae) = 0(ni/5). 
So E.ai^,feJ{%arzf/'' = 0(nV5). ^^^^^ observe that E^alJh{gf + 1) = 
+ the first term is equal to Xc{Y.ialjbxAkigf)) = 

0(Ae) = 0(ni/5); the second term is of order 0{{f-J^/^) = 0{n^/^). There- 
fore, Term A = 0{n^/^ ■ n^S) = 0(^2/5). 

For TermB, a Taylor expansion on the CHF gives 

(A.8) E[{zf - gf - l)(zf ^ - E{z^/'})'] = const +const5.' + 0{gf), 

which, together with Assumption A.3, implies that the order of TermB is 
the maximum of 0{Xc) = 0(nV5) (9((^)i/4) ^ o(^i/5). Thus TermB = 

0(nV5). 

Using Assumption A. 2, the denominator in (A.7) 

Qx^Ei-'^"}) = T.->^Af^^'{l-x. + (b'd - 1) [(l + ^)«A.. - 2] } 



l + -]ax^^-2 



{c,E{z'J'^} - 1) 
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(A.9) 

Plugging (A.9) and the orders of Term A and TermB into (A. 7) yields 

^l|gA-gAjP = 0(l). 

For the covariance term E{gx^ — g)'(gj^ — gx^), we can write 

^^(gA. - g)'(gi, - gAj = ^--^^^^(TermC + TermD), 

i 

i 

Applying Assumption A. 3 and the facts that 

cov{zf,zf^'^) = const + const 5? + 0{gf), 

Qi co\{zi,zf'^) = const c/f + 0{gf), 
which can be derived similarly to (A. 8), it can be shown that 

Term C = 0(^1/5), TermD = ©(n^/^), 
which finally gives E{gXc - g)'(gA - gA J = 0(1)- □ 

Regularity conditions for Theorem 3.2. 

Assumption A. 4. 

EK.<?(-.^{-'^'} - 1)] = O f E for / = 1, 2, 

E[«io.^A?varzf/^] 
i 

= O (^max oio&^ E oi^&fff) ) fo^ ^ = 2, 3, 4. 

Like the previous three assumptions, Assumption A. 4 is exact for GML 
and Cp. For general criteria in the class (2.5), the facts (A. 6), (A.l) and 
E{zf^'^ — E{zf^'^})^ = const + const 5? + 0{gf) suggest that Assumption A. 4 
is reasonably mild. 
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Proof of Theorem 3.2. Let M{Rq) and V{Rq) denote the mean and 
variance of Rq{z). Kou and Efron (2002) showed that 
p_ 

n2 ' 



M{Ro) 



1 



p + q 

+E 



„2 h^P/l{„ 



Aqj Ao't 

Z^t "Aoi"A()t \ 



var z, 



2/q 



Using the Berry-Esseen theorem [Feller (1971), page 521], we have 



P(i?o(z) < 0) - ^{M{Rq)/JV{Rq) ) ^ as n ^ oo 



Note that we can write 



Term 1 



M{Ro) 



where 



p + q 



,/V{Ro) c,(Term3)i/ 

V^„2 Ap-l)/q 
2^ '^Aoi^Aoi 



Term 2 = ^ 



2 

Ao 

3 ?,-2/g^ 



V°Ani 



(c,ii;{zf/'^} - 1 



2/q 



and 



Term3 = ^ 



■J, i2p/q I 



2/9x2 
i "Aoi^Aoi \ 



2/9 



var z. 



2/q 



To obtain the order of Term 1, we need another result from Kou (2003): 
Suppose f ^ oo; then for ah r > i and s < -\, YJi=za\iKi = 0((f )^"). 
This result and Theorem 2.3 imply 

1/4n 



(A.IO) 



Term 1 = O ( ( — 



n 



To obtain the order of Term 2, we note that by Assumption A. 4 and (2.14), 



Term2 = Ao^ 



„2 ifh-^i 



«Aoi 



V n3 A-2/9^ 



2/g 
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(A.ll) 
For Term 3, since 

E 



0(Ao) = 0{n^'^) for all p>q. 



2 i?p/q I 



O 



n 



1/4 



and 



E 



„2 I 2p/g / 
"Aoi^Aoi pAoi 



-2/g\2 -, 



AoE 



3 ;,2p/g-lj 



flAoi 



"'Xoi"Xoi \ 



2/9 

i ""Aoi^Aoi ' 

= 0(Ao) = 0(n^/^) for all p>q, 
using Assumption A. 4 we have 

(A.12) Terms = 0(n^/^) for all p>q. 

Combining (A.10)-(A.12) finally yields 



vyWo 



o 



n 



1/5 



O(ni/^°)<0 for all p>^. 



□ 
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